import numpy as np
import matplotlib.pyplot as plt # To visualize
import pandas as pd # To read data
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from stargazer.stargazer import Stargazer
from IPython.core.display import HTML
import seaborn as snNHGIS Data
First, we are loading in the required libraries and data. I am using python for this because I had trouble getting stata to work well.
data1 = pd.read_csv('nhgis/3.csv', encoding = "ISO-8859-1")
data2 = pd.read_csv('nhgis/4.csv', encoding = "ISO-8859-1")This next section is simply me merging the two datasets together. Then, I created a statistic that represents the fraction of the population with a B.A. or higher. Finally, I isolated the columns we needed for this assignment in the DataFrame.
data = pd.merge(data1, data2, on='GISJOIN')data['PctBA'] = ((data['ALWGE022'] + data['ALWGE023'] + data['ALWGE024'] + data['ALWGE025'])/data['ALWGE001'])data = data[['GISJOIN', 'CBSA_x', 'CBSAA_x', 'ALWGE001', 'ALX5E001', 'AMEME001', 'PctBA']]data.columns = ['GISJOIN', 'CBSA_Name', 'CBSA_Code', 'Population', 'Income_Per_Capita', 'Gini', 'Percent_with_BA']Question G:
We decided to not restrict the size, since the data tables we downloaded were already based around CBSA’s, and the lowest CBSA in the dataset was already over 6000 people large, which seemed large enough to be accurate.
Quesiton H
Just from sorting the values, the least educated CBSA’s all seem to a lot lower in population than the higher educated CBSAs. In addition, they appear to have a lower average per capita income. The Gini seems somewhat lower in the lesser educated CBSAs, however.
data.sort_values('Percent_with_BA')| GISJOIN | CBSA_Name | CBSA_Code | Population | Income_Per_Capita | Gini | Percent_with_BA | |
|---|---|---|---|---|---|---|---|
| 658 | G37770 | Pearsall, TX Micro Area; San Antonio-New Braun... | 37770 | 12449 | 19256 | 0.5087 | 0.072857 |
| 173 | G17500 | Clewiston, FL Micro Area; Cape Coral-Fort Myer... | 17500 | 25671 | 19167 | 0.4681 | 0.082973 |
| 704 | G39700 | Raymondville, TX Micro Area; Brownsville-Harli... | 39700 | 13296 | 14888 | 0.4650 | 0.088598 |
| 82 | G13500 | Bennettsville, SC Micro Area | 13500 | 19108 | 17456 | 0.4782 | 0.088916 |
| 828 | G44900 | Summerville, GA Micro Area; Chattanooga-Clevel... | 44900 | 17107 | 18715 | 0.4476 | 0.089320 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 403 | G27060 | Ithaca, NY Metro Area | 27060 | 59774 | 33075 | 0.4960 | 0.535183 |
| 201 | G18700 | Corvallis, OR Metro Area | 18700 | 55359 | 33817 | 0.4801 | 0.541484 |
| 31 | G11460 | Ann Arbor, MI Metro Area | 11460 | 229268 | 41399 | 0.4848 | 0.559271 |
| 105 | G14500 | Boulder, CO Metro Area | 14500 | 212755 | 46826 | 0.4848 | 0.620766 |
| 497 | G31060 | Los Alamos, NM Micro Area; Albuquerque-Santa F... | 31060 | 13260 | 60746 | 0.3909 | 0.674057 |
938 rows × 7 columns
Regression 1
This Regression is an Simple Linear Regression of Income Per Capita on Total Population (city size)
Income_Pop = sm.OLS(data['Income_Per_Capita'], sm.add_constant(data['Population'])).fit()print(Income_Pop.summary()) OLS Regression Results
==============================================================================
Dep. Variable: Income_Per_Capita R-squared: 0.097
Model: OLS Adj. R-squared: 0.096
Method: Least Squares F-statistic: 100.3
Date: Mon, 11 Oct 2021 Prob (F-statistic): 1.70e-22
Time: 14:06:38 Log-Likelihood: -9467.9
No. Observations: 938 AIC: 1.894e+04
Df Residuals: 936 BIC: 1.895e+04
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 2.752e+04 199.705 137.821 0.000 2.71e+04 2.79e+04
Population 0.0026 0.000 10.014 0.000 0.002 0.003
==============================================================================
Omnibus: 118.872 Durbin-Watson: 1.983
Prob(Omnibus): 0.000 Jarque-Bera (JB): 408.315
Skew: 0.587 Prob(JB): 2.17e-89
Kurtosis: 6.012 Cond. No. 8.16e+05
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.16e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
We observe a statistically significant positive relationship between Income Per Capita and the Size of a city
Next we run a regression of Gini on Population:
Pop_Gini = sm.OLS(data['Gini'], sm.add_constant(data['Population'])).fit()
print(Pop_Gini.summary()) OLS Regression Results
==============================================================================
Dep. Variable: Gini R-squared: 0.023
Model: OLS Adj. R-squared: 0.022
Method: Least Squares F-statistic: 21.79
Date: Mon, 11 Oct 2021 Prob (F-statistic): 3.49e-06
Time: 14:06:39 Log-Likelihood: 1956.8
No. Observations: 938 AIC: -3910.
Df Residuals: 936 BIC: -3900.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 0.4521 0.001 441.058 0.000 0.450 0.454
Population 6.123e-09 1.31e-09 4.667 0.000 3.55e-09 8.7e-09
==============================================================================
Omnibus: 53.772 Durbin-Watson: 1.909
Prob(Omnibus): 0.000 Jarque-Bera (JB): 71.513
Skew: 0.510 Prob(JB): 2.96e-16
Kurtosis: 3.888 Cond. No. 8.16e+05
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.16e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
We get a somewhat confusing result. The coefficient is extremely small, but also has a significant p value. I think this is simply because the relationship is there, but gini coefficients are a much smaller number than the population, so the coefficients are very very small. If you reverse it and have population as the dependent variable, you get the following:
Pop_Gini_Reversed = sm.OLS(data['Population'], sm.add_constant(data['Gini'])).fit()
print(Pop_Gini_Reversed.summary()) OLS Regression Results
==============================================================================
Dep. Variable: Population R-squared: 0.023
Model: OLS Adj. R-squared: 0.022
Method: Least Squares F-statistic: 21.79
Date: Mon, 11 Oct 2021 Prob (F-statistic): 3.49e-06
Time: 14:06:40 Log-Likelihood: -14008.
No. Observations: 938 AIC: 2.802e+04
Df Residuals: 936 BIC: 2.803e+04
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -1.461e+06 3.62e+05 -4.038 0.000 -2.17e+06 -7.51e+05
Gini 3.715e+06 7.96e+05 4.667 0.000 2.15e+06 5.28e+06
==============================================================================
Omnibus: 1493.733 Durbin-Watson: 1.978
Prob(Omnibus): 0.000 Jarque-Bera (JB): 684543.407
Skew: 9.648 Prob(JB): 0.00
Kurtosis: 133.930 Cond. No. 39.7
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Finally we get to the regression of Gini on Income per Capita. There is a significant but negative relationship between the two variables.
Income_Gini = sm.OLS(data['Gini'], sm.add_constant(data['Income_Per_Capita'])).fit()
print(Income_Gini.summary())
OLS Regression Results
==============================================================================
Dep. Variable: Gini R-squared: 0.022
Model: OLS Adj. R-squared: 0.021
Method: Least Squares F-statistic: 20.65
Date: Mon, 11 Oct 2021 Prob (F-statistic): 6.25e-06
Time: 14:06:40 Log-Likelihood: 1956.3
No. Observations: 938 AIC: -3909.
Df Residuals: 936 BIC: -3899.
Df Model: 1
Covariance Type: nonrobust
=====================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------
const 0.4738 0.005 103.256 0.000 0.465 0.483
Income_Per_Capita -7.249e-07 1.6e-07 -4.544 0.000 -1.04e-06 -4.12e-07
==============================================================================
Omnibus: 29.614 Durbin-Watson: 1.892
Prob(Omnibus): 0.000 Jarque-Bera (JB): 34.674
Skew: 0.373 Prob(JB): 2.96e-08
Kurtosis: 3.575 Cond. No. 1.34e+05
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.34e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
stargazer = Stargazer([Income_Pop, Pop_Gini, Income_Gini])
stargazer.significant_digits(4)
stargazer.custom_columns(['Income on Population', 'Gini on Population', 'Gini on Income'], [1, 1, 1])
HTML(stargazer.render_html())| Income on Population | Gini on Population | Gini on Income | |
| (1) | (2) | (3) | |
| Income_Per_Capita | -0.0000*** | ||
| (0.0000) | |||
| Population | 0.0026*** | 0.0000*** | |
| (0.0003) | (0.0000) | ||
| const | 27523.4610*** | 0.4521*** | 0.4738*** |
| (199.7046) | (0.0010) | (0.0046) | |
| Observations | 938 | 938 | 938 |
| R2 | 0.0968 | 0.0227 | 0.0216 |
| Adjusted R2 | 0.0958 | 0.0217 | 0.0205 |
| Residual Std. Error | 5859.5028 (df=936) | 0.0301 (df=936) | 0.0301 (df=936) |
| F Statistic | 100.2890*** (df=1; 936) | 21.7852*** (df=1; 936) | 20.6460*** (df=1; 936) |
| Note: | *p<0.1; **p<0.05; ***p<0.01 | ||
Above is the table comparing all of the outputted regression results.